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0^ ' An exact mapping is established between sequence alignment, one of the most commonly used 

tools of computational biology, and the asymmetric exclusion process, one of the few exactly solvable 
' . models of nonequilibrium physics. The statistical significance of sequence alignments is characterized 

^ ' through studying the total hopping current of the discrete time and space version of the asymmetric 

O . exclusion process. 
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I. INTRODUCTION 



^ ■ 

o ■ 

^ , Sequence alignment is one of the most commonly used computational tools of molecular biology. Its applications 
S ' range from the identification of the function of newly sequenced genes to the construction of phylogenic trees [|l|,|j . 
■fij , Beyond its practical importance, it is one of the simplest model systems for pattern matching. In computational 
^ biology, sequences are routinely compared via a transfer matrix algorithm to find the "optimal" alignment. Recently, 
c/2 it has been noted that this transfer matrix algorithm is the same as the one used to calculate the partition function 
or optimal energy of a directed polymer in a random medium Q . This problem is known to belong to the universality 
class of surface growth as described by the Kardar-Parisi-Zhang (KPZ) equation ji). From the assignment of sequence 
alignment to the KPZ universality class various scaling laws characterizing sequence alignment have been deduced. 
They have been used in order to answer questions of practical importance to sequence alignment, e.g., the optimal 
^ ' choice of alignment parameters |5|-|^. But there are also non universal features which are of great importance for 
Q practical applications. They cannot be extracted from the knowledge of the universality class alone, but have to be 
O [ evaluated by a microscopic study taking into account all the details of the given sequence alignment algorithm. In 
this paper, we will perform such a study for a certain choice of parameters for which sequence alignment maps exactly 
_j ' onto the asymmetric exclusion process MM, which is the best studied nonequilibrium system of the KPZ universality 



^ I class, equivalent also to the six vertex model llO 11 



' We will apply this mapping to address the central question in the biological application of sequence alignment, 
00 I namely the assessment of alignment significance: The problem is that an "optimal" alignment, i.e., the best possible 
• alignment of two given sequences according to some scoring function, does not necessarily reflect sequence homology. 
^ I ' A sequence alignment algorithm will produce an "optimal" alignment for any pair of sequences, including randomly 
Q.^ 1 chosen ones. The important question is whether the alignment produced reflects an underlying similarity of the two 
I sequences compared. A common way to address this question is to evaluate the probability of getting a certain 
"Y"; 1 alignment score by chance. This requires the knowledge of the distribution of alignment scores for random sequences. 

This distribution turns out to obey a universal (Gumbel) form with two non-universal parameters. In this paper, 
^ , we will derive the Gumbel distribution and characterize some of its properties by relating them to the corresponding 
I ' asymmetric exclusion process. In particular, we show how the tail of this distribution can be obtained from the 
■ O [ generating function for the total number of hopped particles. The latter is also the generating function of the average 
surface height in the equivalent surface growth formulation of the asymmetric exclusion process. This important 
quantity has been calculated for the case of continuous time and continuous space using the replica trick |lj] a long 
time ago. More recently, it has been obtained for the case of continuous time and discrete space in the scaling 
regime p^ . Here, we will calculate this quantity in discrete time and discrete space as necessary for the mapping to 
sequence alignment, in the asymptotic large size limit which is beyond the scaling limit. Our calculation does not make 
■ use of the replica trick and leads to a very simple closed form expression. It explicitly contains the anomalous t^^^ 
[ scaling of the surface height fluctuations of KPZ surface growth in one dimension. We use this generating function 
to give an explicit expression for the significance of sequence alignments. 

The paper is organized as follows: First, we will give a self-contained introduction to sequence alignment in Sec. ||. 
This familiarizes the reader with the sequence alignment algorithm and gives us a chance to develop the notations to 



be used later. In Sec. [II, we will reduce the problem of assessing the statistical significance of the widely used local 
alignment to a quantity defined in terms of the simpler global alignment. Readers more interested in the properties 
of the discrete asymmetric exclusion process can skip these two sections and go directly to Sec. which describes 
the simplest version of the global alignment problem. Here, the exact mapping to the asymmetric exclusion process 



in discrete time and space with sublattice-parallel updating is described. Sec. ^is devoted to the calculation of 
the generating function of interest for the asymmetric exclusion process. In Sec. V|, we discuss the result obtai ned , 
apply it to the assessment of alignment significance, and verify the analytical predictions numerically. In Sec. VI], 
we consider more general scoring systems and map them onto generalized asymmetric exclusion processes. The final 
section gives a short summary of the paper and points towards several future directions. A number of technical details 
are given in the appendices. 

II. REVIEW OF SEQUENCE ALIGNMENT 
A. Gapless Alignment 

Sequence alignment algorithms come in different levels of sophistication. The simplest alignment algorithm is 
gapless alignment. It is not only extremely fast but also very well understood theoretically. Thus, it has been very 



widely used, e.g., in its implementation of the program BLAST |14 . 

Gapless alignment looks for similarities between two sequences a = {aia2 ■ ■ ■ flAf}, and b = {6162 • • • ^jv} of length 
M and N ^ M respectively. The letters and hj are taken from an alphabet of size c. This may be the four letter 
alphabet {A, C, G, T} of DNA sequences or the twenty letter alphabet of protein sequences with the letters distributed 
according to the natural frequencies of the twenty amino acids. A local gapless alignment A of these two sequences 
consists of a substring a^-^+i . . . ai-itti of length ^ of sequence a and a substring . . . bj^ibj of sequence b of the 

same length. Each such alignment is assigned a score 

5[^] = 5(»,j,^) = ^s,,_„,,_,, (1) 

fc=0 

where Sa,b is some given "scoring matrix" measuring the mutual degree of similarity of the different letters of the 
alphabet. A simple example of such a scoring matrix is the match-mismatch matrix 

'^■^ ={-f,l^b (2) 

which is used for DNA sequence comparisons For protein sequences, the more complicated 20 x 20 PAM or 
BLOSUM matrices jl^ are used to account for the variable degrees of similarity (e.g., hydrophobicity, size) among 
the 20 amino acids. The computational task is to find the i, j, and i which give the highest total score 

T. = max S[A] (3) 

for a given scoring matrix Sa,b- 

The optimization task called for in gapless alignment can be easily accomplished by introducing an auxiliary 
quantity, Sij, which is the optimal score of the above consecutive subsequences ending at (optimized over i.) It 
can be conveniently calculated in 0{N'^) instead of the expected 0{N^) steps using the transfer matrix algorithm 

S'ij = max{S'i_i j_i + Sa,.,b, , 0}, (4) 

with the initial condition Sa^k = S'fe,o- This recursion equation reflects that for a given (j,j) the optimal £ is 
either zero or larger than zero. If the optimal £ is zero the corresponding score is zero as well. If the optimal £ is at 
least one, the pair {ai,bj) certainly belongs to the optimal alignment together with whatever has been chosen to be 
optimal up to the point (i — l,j ~ 1). Eq. is basically a random walk with increments Sa.t which is cut off if it 
falls below zero. The global optimal score is obtained as 

E = max Si ,-. (5) 

l<i<MS<j<N '■' 

In order to characterize the statistical significance of the alignment, it is necessary to know the distribution of 
E for gapless alignments of two random sequences, whose elements a^s are generated independently from the same 
frequencies Pa as the query sequences, and scored with the same matrix Sa.b- This distribution of E has been worked 
out rigorously |p^ , |l9| . For suitable scoring parameters, it is a Gumbel or extreme value distribution given by 

Pr{E <S}^ exp{~Ke-^^). (6) 



2 



This distribution is characterized by the two parameters A and k with A giving the tail of the distribution and A"-'^ log k 
describing the mode. For gapless alignment, these non universal parameters can be explicitly calculated ||l^ , |l9|] from 
the scoring matrix Sa,b and the letter frequencies Pa- For example, A is the unique positive solution of the equation 



(exp(As)) = y^^PaPb exp(Asa^b) = 1. 



(7) 



The other parameter k is given hy k = KMN , where if is a more complicated function of the scoring matrix and the 
letter frequencies. Instead of reviewing the full derivation of the distribution Eq. and its parameters, we will below 
give some heuristic arguments which yield the known result. These can later be generalized to the more relevant case 
of alignment with gaps. 

For random sequences, one can take j = z in (|^) without loss of generality. Eq. (Q) then becomes a discrete Langevin 
equation, with 



S^,^ = S{t) = m&x{S{i - 1) + s{i), 0}, 
where the "noise" s{i) = Sa^b is uncorrelated and given by the distribution 

Pr{s, > s} = 



{a,b\sa,b>s} 



(8) 



(9) 



The dynamics of the evolution equation (g) can be in two distinct phases. The quantity which distinguishes these 
two phases is the expected local similarity score 



a, 6 



(10) 



If it is positive, the score S{i) will increase on average. After a while, it becomes positive enough that the maximum 
in Eq. ^ will never be given by the zero option. This option could thus be omitted which corresponds to global 
gapless alignment. The dynamics is then a random walk S{i) = S{i — 1) + s{i) with an average upward drift (s). 
The maximal score will be close to the end of the sequences and will be given by S « TV • (s). Since it is linear in 
the length of the sequences, this is called the linear phase of local alignment. It is obviously not suited to identify 
matches of subsequences^ and the distribution of the maximal score E is not an extreme value distribution. (It is just 
a sum of many independent local scores s{i) and therefore obeys a Gaussian distribution according to the central limit 
theorem.) 

The situation is dramatically different if (s) is negative. In this case the dynamics is qualitatively as follows: The 
score S{i) starts at zero. If the next local score s{i + 1) is negative — which is the more typical case in this regime — 
then S remains zero. But if the next local score is positive, then S will increase by that amount. Once it is positive, 
S{i) performs a random walk with independent increments s{i). Since (s) is negative, there is a negative drift which 
forces S{i) to eventually return to zero. After it is reset to zero, the whole process starts over again. The qualitative 
"temporal" behavior of the score S{i) is depicted in Fig. ^ 




FIG. 1. Sketch of the total score as a function of sequence position in gapless local alignment. 



From the figure, it is clear that the score landscape can be divided into a series of "islands" of positive scores, 
separated by "oceans" where S = 0. Each such island originates from a single jump out of the zero-score state and 
terminates when the zero-score state is reached again. Since each of these islands depends on a different subset of 
independent random numbers s(i), the islands are statistically independent of each other. If we let the maximal score 
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of the k island be ak, then these at are independent random variables. Calculating the probability for the maximum 
score ak of an island of length L in a saddle point approximation and optimizing over the length L of the islands, we 
asymptotically obtain a Poisson distribution 

PrK > a) « C,e-^^ (11) 

for the maximal island scores ak (see App. ^.) The parameter A which gives the typical scale of the maximal island 
score is given by the drift-diffusion balance of the underlying Brownian process. If the local scores s{i) were Gaussian 
variables with average u < and variance I?, this drift-diffusion balance would yield 

A = 2M. (12) 

For an arbitrary discrete or continuous distribution of the local scores s{i), it turns out to be given by the more 
general condition (^, which reduces to Eq. ( p^ ) in the limit (s) — > 0~ where the central limit theorem takes hold. 
Since the global optimal score E can be expressed by the maximal island scores as 

E = max{afc}, (13) 

k 

the distribution of S can be calculated from the distribution of the ak- The connection is covered by the theory of 
extremal statistics as developed by Gumbel ||2^;|2l[ . In the case of a large number K^, N oi independent island peak 
scores each of which asymptotically obeys the Poisson distribution Eq. (pi]), the connection is especially simple and 
we get 

Pr{E <S} = Pr{max{(Ti, ...,aK,} < S} = Fr{ai < S}^' (14) 
= (1 - ae"^^)^- « [exp(-ae-^^)]^* = exp(-Ate-^^) 

with K = C^K^, i.e., the parameter A of the island peak score distribution Eq. is the same as the parameter A in 
the Gumbel distribution Eq. (0) of the maximal alignment scores. 



B. Alignment with Gaps 

In order to detect weak similarities between sequences separated by a large evolutionary distance, "gaps" have to 
be allowed within an alignment to compensate for insertions or deletions occurred during the course of evolution [ p2| . 
Here, we will specifically consider Smith- Waterman local alignment In this case, a possible alignment A still 

consists of two substrings of the two original sequences a and b. But now, these subsequences may have different 
lengths, since gaps may be inserted in the alignment. For example the two subsequences GATGC and GCTC may be 
aligned as GATGC and GCT-C using one gap. Each such alignment A is assigned a score according to 

S[A]^ J2 Sa,b-SN^ (15) 

where the sum is taken over all pairs of aligned letters, A'g is the total number of gaps in the alignment, and S is 
an additional scoring parameter, the "gap cost" . In practice more complicated gap scores may be used, but we will 
concentrate on this version. 

The task of local alignment is again to find the alignment A with the highest score as in Eq. (||), in this enlarged 
class of possible alignments. This can be very efficiently done by a transfer matrix method which becomes obvious in 
the alignment path representation ]l5| . In this representation, the two sequences to be compared are written on the 
edges of a square lattice as the one shown in Fig. g where we chose for simplicity N = M. Each directed path on this 
lattice represents one possible alignment. The score of this alignment is the sum over the local scores of the traversed 
bonds. Diagonal bonds correspond to gaps and carry the score —S. Horizontal bonds are assigned the similarity scores 

s{r,t) = Sai,bj (16) 

where and bj are the letters of the two sequences belonging to the position (r, t) — (i — j,i + j — 1) as shown in 
Fig. |. 
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A 

FIG. 2. Local alignment of two sequences CGATGCT and TGCTCGA represented as a directed path on the alignment 
lattice: the diagonal bonds correspond to gaps in the alignment. The horizontal bonds represent aligned pairs. Alignments 
of identical letters (matches) are shown as solid lines; alignments of different letters (mismatches) are shown dashed. The 
highlighted alignment path r(t) corresponds to one possible alignment of two subsequences, GATGG to GCT — C. This path 
contains one gap. It is also shown how the coordinates r and t are used to identify the nodes of the lattice. 



If we were interested in finding the highest scoring global alignment of the two sequences a and 6, this corresponds 
to finding the best scoring path connecting the beginning (0, 0) with the end (0, 2A^) of the lattice. To find this path 
effectively, we define the auxiliary quantity h{r,t) to be the score of the best path ending in the lattice point (r,t). 
This quantity can be calculated by the Needleman-Wunsch transfer matrix algorithm [ p^ 

h{r,t+l) = ma,x{h{r,t-l) + s{r,t), h{r + l,t) ~ 6,h{r - l,t) - 6}. (17) 

This is easily recognized Q as the algorithm used to calculate the zero temperature configuration and energy of a 
directed polymer in a random potential given by the local scores s{r,t). The scores h{r,t) represent the (negative) 
energy of the optimally chosen polymer configuration ending in the point (r^t). Alternatively, the h{r,t) can also be 
interpreted as the spatial height profile of a growing surface through the well known relation between the directed 
polymer and the KPZ equation. 

If we are interested in local alignments, we can use the same trick as in the gapless case (|^). Cutting off unfavorable 
scores by adding the choice of zero in the maximum of Eq. ( [T^ ) leads to the Smith- Waterman algorithm p3t 



S(r,t + 1) = max < 



S{r,t-l) + s{r,t) 
S{r + l,t)-5 
S{r~l,t)~5 




(18) 



The score of the best local alignment is then given by 

E = max 5'(r,i). (19) 

In the presence of gaps, we can still distinguish a linear and a logarithmic phase. If the global alignment score tends 
to grow, the zero option of the local alignment algorithm does not play any role. We effectively revert to global 
alignment and get a maximum score which is linear in the length of the sequences. Contrary to gapless alignment, 
it is not enough to have a negative expectation value of the local scores (s) in order to prevent this. This is due to 
the fact that the alignment algorithm uses gaps to connect random stretches of good matches to optimize the score. 
The average score grows by a gap dependent amount u{{sa,b\i5) faster compared to the expectation value (s). The 
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log-linear transition occurs now at u{{safi},Sc) + (s) = 0. For the simple scoring system Eq. (||) this corresponds to 
a line SdfJ.) in the two dimensional space of the parameters fi and 6 shown in Fig. ^ Even for this simple scoring 
system, the loci of the phase transition are only known approximately for more complicated scoring systems, 
only numerical results are available. 
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FIG. 3. Loci of the log-linear phase transition for alignment with the scoring system Eq. (□) for an alphabet of c = 4 letters 
in terms of the mismatch cost ^ and the gap cost 5. Useful alignments can only be obtained in the logarithmic phase above 
the phase transition line. The diamonds are numerically estimated points on the phase transition line; the solid line is the 
approximate locus calculated in [^. Below the dashed line the alignments do not depend on the mismatch cost /i any more 
and the phase transition line is known to be strictly horizontal. 
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If the parameters are chosen such that u + (s) < 0, i.e., such that the expected global alignment score drifts 
downwards on average, then the average maximum score (S) is proportional to the logarithm of the sequence length 
as in the logarithmic phase of gapless alignment. The reduced value of (E) in the logarithmic phase makes it the regime 
of choice for the purpose of homology detection. Again, the distribution of S must be known for local alignments of 
random sequences in order to characterize the statistical significance of local alignment. There is no rigorous theory 
of this distribution in the presence of gaps. However, there is a lot of empirical evidence that the distribution is again 
of the Gumbel form |^ |30|| . The values of the parameters k and A are only known approximately for a few cases 
close to the gapless limit | |31| , ^ . In practice, they are determined empirically by time consuming simulations. Below 
we will present an explicit calculation of the parameter A for a simple scoring system. 



III. SIGNIFICANCE ESTIMATION USING GLOBAL ALIGNMENT 



As a first step, we want to show that the parameter A, which describes the tail of the Gumbel distribution, can be 
derived solely from studying the much simpler global alignment (]l7|). Later, we will see that global alignment is in 
certain cases equivalent to the asymmetric exclusion process. We will derive an explicit formula for A by studying the 
corresponding asymmetric exclusion process. 

Let us define the generating function 

Z(A; L) = {exp[Xh{r = 0, L)]) (20) 

where the brackets (•) denote the ensemble average over all possible realizations of the disorder, i.e., over all choices 
of random sequences a and b and /i(0, L) is the global alignment score at the end of a lattice of length L as shown in 
Fig.|^(a). It can be obtained from the recursion relation (p7|) with the initial condition h(2k, t = 0) h{2k+l,t = l) = 0. 
It will turn out that the parameter A of the Gumbel distribution is obtained from 

lim Z{\-L) = 1. (21) 

L^OQ 
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Note, that this condition reduces simply to Eq. in the case of gapless aUgnment, since for infinite gap cost we 
have 



L/2 



(exp[A/i(0, L)]) = (exp[A^s(0,2fc- 1)]) = (cxp[As]) 



L/2 



(22) 
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FIG. 4. Global alignment lattice used for significance estimation, (a) shows the right half of the lattice from Fig. Q It can 
represent all possible paths of length L which end at the point (r, t) — (0, L) and start at (r, 0) for an arbitrary r. (b) shows 
such a path schematically. It represents the "rim" of an island with its high score denoted by the filled dot at the tip of the 
triangle. The open dot at (ro,to) represents the corresponding island initiation event. 



The key observation which leads to the result (^) and ( ^l|) is the fact that similar to the case of gapless alignment 
discussed in the last section, the points on the alignment lattice can be grouped together as islands Js^ . By the 
construction of the local alignment algorithm (|l8|), many points on the alignment lattice have a score of zero in 
the logarithmic alignment regime. As for gapless alignment, a positive score will be generated out of this "sea" of 
zeroes, if a good match occurs by chance. This positive score can then imply further positive scores via the recursion 
relation (18). For every point (r, t) on the lattice which has a positive score, we can define a restricted optimal path 

i{t), which is the highest scoring path out of all paths ^(t) with an end fixed at r{t) = r; see the example in Fig. 
The path must start at some point (ro,to) where a positive score is created out of the zero sea by a good match. An 
island is then defined to be the collection of points (r, t) with positive score, i.e., S{r,t) > 0, and whose restricted 
optimal path ^ (r) originates at the same point (rp, to)- A sketch of these islands is shown in Fig. Each of these 
islands has a maximum score which we denote by fife as we did in the gapless case. By this definition, every lattice 
point with a positive score belongs to exactly one island. Thus, the maximal score E on the total lattice is given by 
Eq. (|l^). Since large islands are well separated by a sea of points with score zero, they are statistically independent 
clusters on the alignment lattice. Thus, their maximal scores ak are again independent identically distributed random 
variables which yield a Gumbel distribution of S via Eq. ([T^). Our task is thus to calculate the distribution of the 
island peak scores ak in the presence of gaps. 
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FIG. 5. Sketch of some islands on the local alignment lattice. The lattice sites with a positive score are marked with dots. 
The bonds which have been chosen in the maximization process (^) are highlighted. Together they are the restricted optimal 
path associated with each point with a positive score. Each of these paths goes back to an island initiation event which is 
marked by an open dot. The large filled dots mark the positions of the highest scoring point on each island. 



This distribution of maximal island scores can be derived analogously to the gapless case (App. While a single 
gapless island is described by a random walk of some optimized length L, an island with gaps corresponds to a global 
gapped alignment of some optimized length L as the one shown schematically in Fig. ^(b). Using this replacement, 
the maximal island distribution again has an asymptotically Poissonian form ( pj]) with the decay constant A given by 
Eq. (^). An approximate interpretation for the result (21) is the following: Due to the choice of scoring parameters 
in the logarithmic phase of local alignment, the average score {h{0, L)) of global alignment with the same choice of 
parameters decreases linearly with the length L of the alignment. Thus, typical configurations of the disorder have 
a strongly negative score h{0,L) and hardly contribute to Z{X;L) — (exp[Aft.(0, L)]). Only on very rare occasions, 
h{0,L) is positive for large L and contributes significantly to Z{X;L). The fact that there is a choice of A with 
Z(A; L) = I for large L implies that these configurations with positive h(0, L) are exponentially rare. It is thus 
necessary to weight these configurations with the exponential factor exp[A/i(0, L)], and choose A to match the decay 
constant of the probability of finding such rare events. 

As already noted in the analogy between the directed polymer and sequence alignment, the score corresponds to the 
(negative of the) free energy. Thus the quantity Z{\\ L) = (exp[A/i(0, L)]) can be interpreted as the disorder-averaged 
(zero temperature) partition function|^of A "replicas" of a directed polymer of length L. Note that the replica number 
given by A need not be integer. In the surface growth interpretation, Z{\] L) is the generating function for the space 
averaged surface height. While many of the universal features of global and local sequence alignment (e.g., its scaling 
behavior in the logarithmic phase and upon approaching the phase transition line) can be understood merely from 
the knowledge that sequence alignment belongs to the KPZ universality class plH-R] or from the limit Z{\ — > 0;L), 
a solution of Eq. (^ ) for the non universal quantity A requires the knowledge of the large L behavior of the entire 
function Z{\; L) and hence a more detailed microscopic calculation for the given model. 



IV. GLOBAL ALIGNMENT AS AN ASYMMETRIC EXCLUSION PROCESS 



'^However, Z[\; L) should not be interpreted as the partition function at temperature 
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A. A Simple Model of Sequence Alignment 



From now on we will focus on global alignment as described by Eq. ( p^ ) , and use Eq. (|T]) to infer the value of the 
parameter A characterizing local alignment. In order to simplify the presentation, we restrict ourselves here to a very 
simple scoring system. As will be discussed in latter sections, our formulation as well as some of the results can be 
generalized to the more complicated scoring systems. 

We will study the scoring system in which the local similarity scores Sa.b can take on only two possible values, 

^a.b ^[ll^l ■ (23) 

Moreover we will choose the gap cost to be (5 = 0. With this choice of the scoring parameters, the score h has the 
additional interpretation of being the length of the longest common subsequence of the two sequences a and b. This 
longest common subsequence problem has a long history^ as a toy model for sequence comparisons [^Sj-^ . 

Additionally, we will neglect correlations between the local scores s{r,t), which arise from the fact that all M x N 
local scores are generated by the M + N randomly drawn letters. Instead of taking these correlations into account, 
we will assume that s{r,t) — r]{r,t) with independent random variables r]{r,t) given by 

r,(r t)-l^ "^'^^ P'°^^^- P (24) 

V[r,i) - I Q ^j^j^ probab. 1-p ^ ' 

with 

Pr{V,,t ?7(r, t) = rjr,t} = J] Mv{r, t) = Tjr,t}- (25) 

r,t 

To model sequences randomly drawn with equal probability from an alphabet of size c, we take p ~ 1/c. The 
approximation (^5|) is known to change characteristic quantities of sequence alignment only slightly We will 
confirm numerically at the end of this paper, that this also holds for the values of A which we are mainly interested 
in here. For our choices of parameters, the global alignment algorithm ( p^ ) reads 

h{r, t + max{/i(r, t - I) + r]{r, t), h{r + 1, i), h{r - 1, t)}. (26) 



B. Choice of the alignment lattice geometry 

In order to handle finite size effects better, we will use a rectangular geometry (Fig. ^ for the alignment lattice, 
instead of the triangular geometry shown in Fig. |^(a). We will further apply periodic boundary condition to the top 
and bottom edges of the lattice, i.e., /i(0, t) — h{2W, t) for a rectangular lattice of width 2W, and will start on the left 
edge with the initial conditions h{2k + l,t = 0) = h{2k, t = 1) = 0. Note that despite the different lattice geometries, 
the score h{r,t) for all points with t < W on the rectangular lattice will be identical to the score at the same {r,t) 
coordinate on the triangular lattice^; see Fig. |[ 



^The longest common subsequence model is in the limit of an alphabet size equal to the sequence length also related to the 
longest increasing subsequence model which has been of recent interest in connection with surface growth processes [ pi] , 
^ Since directed polymers in a random medium are known to have a wandering exponent = 2/3 this actually still holds for 
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Q/^N^^ ^ 1^ 

FIG. 6. Rectangular alignment lattice of width 2W with periodic boundary conditions in the spatial (vertical) direction. We 
use this lattice instead of the triangular lattice shown in Fig. |^(a) in order to simplify the handling of finite-size effects. As 
indicated by the thick gray lines, the score at a point with t < W as the one at the tip of the triangle is identical with the 
corresponding score calculated on a triangular lattice as the one shown in Fig. ^a). 



C. The dynamics of sequence alignment as an asymmetric exclusion process 

In this section we will perform a change of variables on the sequence alignment algorithm ( p6| ) for the rectangular 
lattice shown in Fig. ^. We will find that the resulting problem is equivalent to an asymmetric exclusion process on a 
one-dimensional lattice of width 2W. As a guidance towards the choice of suitable variables, we take the knowledge 
from the (continuous) KPZ equation that the gradient of the surface height is an especially simple quantity. At a 
fixed time, the gradients at different positions become uncorrelated and Gaussian distributed Thus, we will 

look at their discrete analogs in the alignment problem. They are the score differences between neighboring lattice 
points and thus located on the diagonal bonds of the lattice. We will parameterize these score differences by the 
bond variables n{r,t). They will later turn out to be the occupation numbers of the sites of an asymmetric exclusion 
process. With the choice of coordinates as illustrated in Fig. 0(a), we define them to beQ 

f h(r+l,t)- h(r, t + 1) + 1 for r + i even 
n{r,t)=l (27) 
\ h{r + l,t+l) - h{r,t) forr + iodd 

As explained in detail in App. ^ rewriting the time evolution equation ( p6| ) in terms of the variables n{r, t) leads 
to a time evolution equation of n{r, t) alone, without any reference to the absolute scores h{r, t). Moreover, this time 
evolution equation implies that the score differences take only the values n{r,t) £ {0,1}. By the structure of the 
alignment lattice as a composition of elements as the one shown in Fig. 0(a), the resulting time evolution for the 
n(r,t) transforms a pair (n(r — l,t — l),n(r,t — 1)) £ {|00), |01), |10), |11)} into the new pair (n(r — l,t),n{r,t)) G 
{|00), |01), |10), |11)} independently from all the other n{r',t — 1). This transformation only depends on the single 
random variable rjir, t) and can be expressed by the transfer matrix 

10 
1 . 

in the basis |00), |01), |10), |11). We can thus interpret the action of the lattice element shown in Fig. |^(a) as a 
"device" like the one shown in Fig. |^(b) which takes a pair of variables {n'i,n'2) as its inputs, applies the transfer 
matrix Ti(0), and generates a new pair of variables (ni, 722) as its outputs. 



*Note, that the n(r, t) are not literally score differences but suitably chosen parameterizations of these score differences. This 
complication is necessary in order to enable the interpretation as the particle occupation numbers in the asymmetric exclusion 
process. 
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h(r+l,t) 




"2 



h(r-l,t) 



(a) ^ (b) 

FIG. 7. One building block of the alignment lattice. By our numbering scheme of the lattice r and t are either both even or 
both odd. (a) shows the scores at the lattice points and the bond variables n{r,i). (b) shows this building block as a "device", 
which takes two incoming bond variables n'l and n'2 and transforms them with the help of the transfer matrix To into the new 
bond variables n\ and n2. 



We recognize the action of the transfer matrix Ti(0) as the elementary time step of an asymmetric exclusion 
process, if we interpret the n(r, t) as particle occupation numbers on a one-dimensional lattice of 2W sites with 
periodic boundary conditions as the one shown in Fig. ^. Each of these sites can either be empty or occupied by a 
single particle. In each time step for each pair of neighboring sites, a particle hops to the right with some probability 
1 — p, if the site to its right is empty according to the non vanishing entry |10) — > |01) of the transfer matrix Ti(0). 
If there is no particle or if the site on the right is already occupied the configuration remains unchanged. 




12 ^ 2w-l r 

FIG. 8. Interpretation of the transfer matrix Ti(0) as given in Eq. (Kq) as an asymmetric exclusion process. A configuration 
of the local score differences is represented by particles on a one-dimensional lattice of width 2W . At an odd time step for each 
even site r — 1 a particle hop is attempted with probability 1 — p. In our example, the particle at site cannot hop, since site 
1 is already occupied. The particle on site 2 can hop to site 3 as indicated by the dashed square. 



In terms of the elementary devices shown in Fig. |7|(b) the lattice structure of Fig. |6| can be depicted schematically 
as shown in Fig. ^. Thus, the process of hopping a particle to the right is attempted for each even numbered site at 
odd time steps and for each odd numbered site at even time steps. This hopping dynamics is exactly the asymmetric 
exclusion process with sublattice-parallel updating with periodic boundary conditions^ ||lo| , |39|] . 



^If we had chosen the "hard wall" boundary conditions h{ — l,t) — h{2W,t) = 00 instead of the periodic boundary conditions 
h{2W, t) — h{0, t) for the score, we would have arrived at the asymmetric exclusion process with sublattice-parallel updating 
and open boundary conditions at a feeding and extinction rate ofa = /3 = l— pat the two ends of the lattice with 2W — 1 
sites respectively. 
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FIG. 9. Schematic representation of the alignment lattice of Fig. g as an "electric circuit". The boxes represent elements 
of the type shown in Fig. |^b). They take two particle occupation numbers as their "inputs" and generate two new particle 
occupation numbers as their "outputs". Their interconnection into a layered structure as shown here with a shifted pairing 
scheme in every time step leads to the non-trivial behavior of sequence alignment. 



In reducing the dynamics from a dynamics of scores into a dynamics of the occupation numbers n(r, t), one has to 
pay attention to the boundary conditions. Periodic boundary conditions for the n{r, t) do not automatically lead to 
meaningful periodic boundary conditions for the scores h{r, t). We thus have to impose the additional constraint that 
the total sum of the local score differences across the whole lattice vanishes. In terms of our bond variables n(r, t) 
this translates into the condition 

2W-1 

V n(r,t) = -, (29) 

r=Q 

i.e., the system of hopping particles is at half filling. Since the number of particles is conserved under the dynamics 
described by the transfer matrix Ti(0), the condition ( p9| ) is guaranteed to hold if we choose the initial conditions 
Sr^~^ t — 0)/2W = 1/2. Particle densities different from one half would correspond to a tilted "score profile" 
h(r, t) at each fixed time t. 



V. THE GENERATING FUNCTION 



A. Expressing the generating function in terms of the hopping process 



We now want to apply the mapping between sequence alignment and the asymmetric exclusion process to the 
practical problem of assessing alignment significance. As noted in Sec. [II, this amounts to calculating the generating 
function 

Zo(A;A^)= (exp[A/^(0,iV)])o, (30) 



where (. . .)o denoted the average over the ensemble of uncorrelated disorder defined by Eqs. (^4|) and (^5|). Thus, we 
first need to express the total score h{0,N) in terms of the occupation numbers n{r,t). As explained in more detail 
in App. |b|, h{0,t) is on average incremented by 1/2W every time the transfer matrix Ti(0) is applied except for the 
transition |01) |10). Thus, Zo{X;N) can be expressed as 

Zo{X;N) = exp[AAr/2](exp[-AJ])o (31) 

in terms of the total number of particle hops per lattice site 

N/2W-1 

-^^^E E(j(2fc + l'2?-l)+j(2fc,20), (32) 

1=1 k=Q 

where j{r,t) E {0,1} is the number of particle hops at lattice site (r,t). We thus need to determine the generating 
function 
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Q{X;W,N)= {exp[-XJ]}o 



(33) 



for the asymmetric exclusion process. Note, that this is different from the generating function of the local current 
j(r, t): since J/N is the time and space averaged current, Q contains information on spatial and temporal correlations 
in the number of hopping particles which the generating function for the local current does not. 



B. The Generating function as an eigenvalue problem 



Now we will reformulate the calculation of the generating function Q(A; W, N) for the asymmetric exclusion process 
as an eigenvalue problem. As already mentioned, exp[— AJ] is a product of factors exp[— A/2VF] for every particle 
that hops. Since the dynamics of the hopping process is described by the transfer matrix Ti(0) defined in Eq. (|2^), 
we can calculate Q(A; W,N) by associating a weight exp[— A/2VK] to the element of the transfer matrix Ti(0) which 
corresponds to a hop. This can be derived more formally from a dynamics path integral representation of Z(){X; N) 
as detailed in App. |Q. We get the modified local transfer matrix 



Ti 



/I 













1 (1 


— p)e^ 2W 











P 





Vo 








1/ 



(34) 



in the basis |00), |01), |10), |11) of a pair of neighboring lattice sites. 

Next, we need to take into account the special lattice structure of Fig. ^. We note that at every even time step the 
lattice is decomposed into W of the building blocks described by Ti. Thus, a single time step of the total system at 
even time is described by the matrix 



w 



iTi 



fe=i 



(35) 



At odd times the dynamics is the same, but the pairing of neighboring sites is shifted. To generate the time evolution 
at odd time steps, we can thus shift all particles to the right, apply the dynamics of even time steps and then shift 
all particles back to the left. Let C be the translation operator such that 



C|noni 



■ n2w-i/ 



ni 



■ n2w~ino) 



(36) 
With this 



which shifts all particles by one site to the left taking into account the periodic boundary conditions 
definition we can write Todd — CTvi/(A)C^^. 

The sublattice-parallel updating procedure (i.e., the structure of the lattice as depicted by Fig. m finally leads to 



QiX;W,N) = (^i|(Te 



Ddd 



(^i|(Tw(A)CTw(A)C-i)^/2|^o) 



(37) 



where \ipo) is a 4^^ dimensional state vector representing the initial conditions, and {ipil is the dimensional vector 
whose entries are all 1, used here to denote a summation over all possible final configurations. In the limit of large 
N W, this obviously becomes 



Q{X;W,N)^p^{X) 



(38) 



where p^(A) is the eigenvalue of Ti^(A)CTvi/(A)C^^ with the largest real part. Since this matrix has no negative 
entries and is irreducible for non-pathological choices of the scoring matrix (while restricted to the physical sector of 
half filling), the largest eigenvalue of this matrix is guaranteed by the Perron Frobenius theorem to be non degenerate 
and real, and its eigenvector can be chosen without negative entries. When A = 0, we have p(0) = 1 and its eigenvector 
is the stationary distribution of the asymmetric exclusion process, which is a simple tensor product of independent 
occupation numbers. This is no longer the case for A ^ 0. 



C. Calculating the largest eigenvalue 

For a finite W, it is in principle possible to solve for the largest eigenvalue of the 4^ dimensional matrix 
Tt4a(A)CTvf(A)C~^ by directly diagonalizing the matrix. It is convenient to reduce the size of this matrix by ex- 
ploiting some symmetries. Since the lattice is translationally invariant with respect to shifts in r by 2, we expect 
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the same symmetry of the largest eigenvalue of Tvi/(A)CTn/(A)C ^. Thus, for the purpose of computing the largest 
eigenvalue we can restrict ourselves to the subspace C of translationally invariant vectors 



C 



{1^)1^21^-) = IV.)}. (39) 



This corresponds to a discrete Fourier transform of the matrix Tvf(A)CTvk(A)C"^ and choosing the fc = component. 
On C, we have = C by definition. Thus, it is enough to look for the largest eigenvalue pwW of the matrix 
T^(A)C restricted to C. A further restriction which helps reducing the size of the matrix is the mirror symmetry of 
the lattice which has to be respected by the eigenvector as well. Additionally, Tvk(A)C has to be restricted onto the 
physical subspace of half filling. 

After applying these simplifications, the largest eigenvalue can be calculated for small widths W using computer 
algebra. Although the matrix T\y(X)C explicitly contains the quantity exp[— A/2VK], it turns out that the characteristic 
polynomial depends only on exp[— A/2]. This is a consequence of the translational invariance of the lattice]^. In order 
to reveal the underlying structure of the largest eigenvalues for different W, it is very useful to expand the resulting 
largest eigenvalues pwW in powers of this quantity e^'*'/^. We get 



w 


= 1 : 


Pi (A) 


= Vp + 








w 


= 2 : 


P2(A) 




{p - l)e-^ - 


F0((e-^)2) 




w 


= 3 : 


P3(A) 




{p-l)e-^ - 


^ip-l)^{e-if- 




w 


= 4 : 


P4(A) 




(p- l)e"^ - 


Kp-l)VP(e"^)'- 


- {p- 1)\/P^(e 



where the 0((e^^/^)'^) terms denote terms of the given order with prefactors which are different for different W. We 
can see that the coefficients up to order (e~^^^)^~^ remain unchanged upon increasing W and they constitute the 
beginning of a simple geometric series. Assuming that this pattern holds for arbitrary orders, we can resum the series 
for any fixed A > and get 

p(A) ^ hm pwW = y^^' \ . (40) 
Combined with Eqs. (|3l|), (^3|), and ( ^8|) this yields the generating function 

(A \^ 
l + ^eM^] ^X\ ^^^^ 
l + ^exp[-f] 2 J 

in the limit of large N. 

A related generating function has also recently been calculated in the simpler case of a discrete space and 
continuous time asymmetric exclusion process. In this case, the full dependence on the finite width W can be 
computed. While Derrida et al. find that the generating function takes a universal form in the limit W oo 
with \W^^^ kept constant, the problem of assessing statistical significance of sequence alignments calls for the limit 
^ oo at a fixed A > 0. This limit goes beyond the universal regime. For the asymmetric exclusion process with 
sublattice-parallel updating it is given by our Eq. ( ^ 

Eq. (|4^) can be generalized to the match-mismatch scoring system given in Eq. (^) with a gap cost 6 — jjL/2 
for an arbitrary value of p.. If we denote the score in this scoring system by h'{r, t) it is connected to the score h{r, t) 
of the scoring system with p = 5 = Q hy the simple global rescaling and shifting 

h\r,t) = {l + pi)h{r,t)-^t. (42) 



^Instead of looking at the average score ^(A'') = h{r, N) as we do in the derivation of Eq. (|^ in App. y, we could also 

have chosen a specific position, say r = and r = 1, and monitored the behavior of the score ^(A'') = | [fe(l, A*') + /i(0, — 1)]. 
Since the differences between scores at the same time are bounded, these two qujintities must have the same generating function 
for large A'^. The transfer matrix which calculates the generating function for h{N) is T(A) = Ti(A) ® Ti(0) instead of 

Tvi'(A). It has the technical disadvantage that it breaks the translational invariance, but it explicitly depends only on exp[— A/2] 
instead of exp[— A/2W]. 
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Thus the corresponding generating function is given by 

Z(A, ^l■ N) EE (e^'''(o.^)) = e-A'^(e^(i+A')Mo,A^)). (43) 

If we again neglect correlations and use uncorrelated random variables 

r 1 with probab. p 
' ' 1 — ^ with probab. l—p 



the same rescaling and shifting leads to 

l + VPCxp[-|(l + Ai)] 2 



VI. IMPLICATIONS ON DIRECTED POLYMERS AND SEQUENCE ALIGNMENT 

Now, we will study the consequences of our main result Eq. (|45|). First, we will discuss the general properties of 
the generating function and its implications on the physics of directed polymers in a random medium. Then, we will 
come back to our original question of the assessment of sequence alignment significance. We find, that Eq. (^5|) is 
an explicit expression for the significance assessment parameter A. It reproduces known limiting cases and we will 
demonstrate that our result agrees well with numerical simulations. 



A. Properties of the generating function 

The most notable property of the generating function of the connected moments of the average score (or average 
height) 

log(exp[A/i'(0,iV)])o = logZo{X,^i■,N) (46) 
is that it is an odd function of A. The first two terms of its expansion are 

log Zo(A, N) ^ ^^^^^ ^ 1 , ^ ^^^^ 



where 



N 6 



[Zo(A, ^; N)]^ = -f + (1 + /i)T^ (48) 



and b{iJ.) = ^ 1+^ ^ (1 Vp)Vp > already mentioned, we can regard the generating function Zq{X, fi; N) as the 



ensemble averaged partition function of A replicas of a directed polymer in a random medium. In this sense, Eq. (|47^ 
is the free energy per length of this A replica system. It has the same form (with a vanishing quadratic term) as the 
result of an earlier explicit replica calculation in continuous time and continuous space p^ . However, our analysis is 
directly on the discrete model and is not plagued by the difficulty of taking the continuum limit in JT2| . Moreover, 
we do not have to rely on a questionable analytic continuation in the replica number since our calculation is valid for 
an arbitrary A > from the very beginning. 

The vanishing of the second order term in A will not even be affected by the universal contributions to our result for 
small A which have been found in using the explicit dependence on the width VF, since its second order coefficient 
vanishes as W~^^-^ in the limit of large width. The consequence of this vanishing second order term in A is that the 
second connected moment of the average height, i.e., the height fiuctuations, scales sublinear in N. Instead the third 
moment of the height fiuctuations scales linearly with N. This is a signature of the presence of the anomalous iV^/^ 
fluctuations of the average surface height characteristic for the KPZ universality class. 
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B. Statistical significance and tlie log-linear transition 



According to Eqs. ( pT| ) and ( p5| ) the parameter A which characterizes the statistical significance of local alignments 
with the match- mismatch scoring scheme Eq. and gap cost S = is given by the unique positive solution of the 
equation 

1 + Vpcxp[|(l + ^)] A 



exp[--M] - 1. (49) 



l + VPCxp[-|(l + Ai)] 2 

In the limit of large /i, the solution of Eq. ( ^9| ) converges to A = — logp. This is the value which we expect since this 
limit corresponds to the case of gapless alignment (recall that (5 = /i/2 here), and A = — logp is the solution of the 
large fi limit of Eq. (Q). If the gap cost is decreased, A is reduced, too. At some critical value of /i there will not 
be any positive solution of Eq. ( |4S| ) any more, i.e., islands of all sizes are equally probable. This indicates a phase 
transition between the logarithmic and the linear alignment phase. The approach of this phase transition is especially 
interesting. 

Close to the phase transition, we can use the expansion (B^) and rewrite Eq. (Eot) as 



viii)X + h{^i)\'^ + O{X^)^0. (50) 
6 

From this expansion the origin of the phase transition is very clear: If f (/i) > 0, the right hand side of Eq. ( pO| ) is 
a monotonously increasing function of A. Thus, A = is the only solution of Eq. (|5^). This corresponds to a flat 
distribution of island sizes, i.e., the linear alignment phase. If i;(/i < 0), the shape of the right hand side of Eq. ( [50| ) 
changes and there are three roots, one of which is the positive solution 

This indicates that we are in the logarithmic alignment phase. Thus, the phase transition occurs at the critical 
mismatch cost which is defined by the condition 

vi^ic) = 0. (52) 
Using the expHcit form (Q of v{iJ,), we get the critical mismatch cost 

= (53) 

This reproduces the already known result for the phase transition point of this model. As the mismatch cost /i 
approaches this critical value from above, A vanishes as 

In the case of finite width W, the above expression is valid down to A ~ W^^. This confirms the characteristic 
universal power law |/i — /^d^^^ proposed previously B by scaling arguments. 



C. Numerical Verification 

In order to test the approximation of uncorrelated local disorder ( p5| ) and the heuristic elements of the derivation 
of Eq. (^9[), we performed extensive numerical simulations to corrobate our result. We used the DNA alphabet of size 
c = 4 with identical frequencies for all four letters, i.e., p = 1/4. For different choices of the mismatch cost /x with 
corresponding gap cost S = fi/2, we used the island method to find the values of A as a function of /i numerically. 
For each value of 6 several billion islands have been generated using sequences of iV = 25,000 in order to achieve 
relative errors of approximately 1%. We used completely uncorrelated local scores chosen as 

s(r t) = l ^ ^^^^ probab. p 
^ ' ' [ —fi with probab. I — p 
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with p= 1/4. The resulting values of A are shown in Fi g. | To| . The solid line is the solution of Eq. (|4^) and the circles 
represent the values of A for uncorrelated local scores (^sf). As shown in Fig. |l0| the observed A's follow the analytic 
solution very closely, thereby confirming Eq. (^). We also included the values of A which result from correlated local 
scores generated from aligning randomly chosen sequences according to Eq. (^. As one can see, they deviate only 
slightly from the analytical result for uncorrelated disorder. This deviation is strongest close to the log- linear phase 
transition, which for uncorrelated disorder happens at jic = 2. The difference of ~ 2% in /ic between the correlated 
and the uncorrelated case rapidly becomes much smaller for larger alphabet sizes c . 








1 



3 4 5 

|a=25 
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FIG. 10. Dependence of the significance parameter A on the scoring parameter jj,. The circles represent the numerically 
obtained values of A for uncorrelated local disorder (^5|) with match probability p — 1/4 for which Eq. (^) (the solid line) has 
been derived. They agree well with the analytical result. The diamonds correspond to local disorder generated by comparing 
two randomly chosen sequences over an alphabet of size c = 4. The values of A obtained from the two ensembles differ from 
each other only very close to the phase transition point /ic ~ 2. 



VII. MORE GENERAL SCORING SYSTEMS 



While the approximation of the ensemble of random sequences by the ensemble of independent local scores appears to 
have negligible effects, our treatment is so far limited to the special scoring system Eq. (gj). While the computation of 
the generating function (exp[— AJ])o seems feasible only for this special scoring system, the mapping to an asymmetric 
exclusion process and the reformulation as an eigenvalue problem is still possible for more general scoring systems . 

We consider here scoring systems satisfying the following two conditions: First, the differences between the possible 
values Sa,b of the scoring matrix are multiples of some score unit A. Second, the gap costs S is such that 26 + sq is 
also an integer multiple of A, with 

So = niax{sa,fc} (56) 

a,b 

being the maximal entry of the scoring matrix Sq These two conditions are easily satisfied (with A = 1) by the 
most frequently used protein scoring systems p^ , p^ which use integer scores and gap costs for performance reasons. 
For the match-mismatch scoring system (||), the first condition is satisfied with A = 1 + M, while the second condition 
applies only to a discrete set of S's. However, it is possible in principle to interpolate to arbitrary gap costs p^ . 

Mapping to an asymmetric exclusion process is possible for scoring systems satisfying the above two conditions. It 
will be convenient to express the gap cost 6 in the following way. 
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25 



<A - So with rimax G N. 



(57) 



As before, we shall ignore correlations between the local scores s{r,t) and introduce uncorrelated random variables 
r]{r, t) £ {0, 1, . . .} such that 



I.e., 



with 



s{r,t) = So- rjir, t)A, 
Pr{Vr,t??(r,i) = ryr.t} = nPr{»7(»":i) = Vr,t} 



(58) 
(59) 

(60) 



a,b 



Note, that these random variables ry(r, t) only take on a finite number of different positive integer values, since the 
scoring matrix Sa.b itself has only a finite number of entries. 

A derivation analogous to the one given above for the longest common subsequence problem again maps the 
dynamics of the alignment algorithm onto the dynamics of particles on a one-dimensional lattice. The state of the 
system is still given by the number of particles n(r, t) at each lattice site, but now these occupation numbers are 
defined as 



n{r,t) 



^[h{r + l,t)- h{r,t + I) + 5 + sq] r + t even 
-^[h{r + l,t + l) - h{r,t) + S] r + t odd 



(61) 

(62) 
(63) 

(64) 
(65) 



and can take any integer value between and rimax. The dynamics is given by the relations 

n(r — 1, t) = n{r — 1, < — 1) — j{r, t) and 
n{r,t) = n{r,t- l)+j{r,t) 

for even r + t, where 

j(r, i) = mm{r/{r,t),n^i,^ - n{r,t - l),n{r- l,t- 1)} 
and the total number of particles is fixed to be 

2W-1 

— V n(r t) = 
2W ^ ' ' 2 ■ 

Eqs. (|6^)-(|6^) can be equally expressed as the following cellular automata: For each time step and for each pair of 
neighboring sites of the one-dimensional lattice the particles live on, 

1. choose an integer number 77 > of particles to hop from site r — 1 to site r according to the distribution ( |60| ) 

2. if there are fewer particles than t] on site r — 1, then reduce 77 to the number of particles on site r — 1 

3. if there are fewer free spaces than 77 on site r, then reduce 77 to the number of free spaces on site r 

4. move r\ particles from site r — 1 to site r 
This updating rule is to be applied sublattice-parallel as for the simpler scoring system. The process is illustrated in 



Fig. [T|. 



mi 
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2w-l 



18 



FIG. 11. Interpretation of Eqs. (|62|)-(|6^) as a generalized asymmetric exclusion process. A configuration of the local score 
difi'erences is represented by particles on a one-dimensional lattice of width 2W. Each lattice site can accommodate up to rimax 
particles (here Timax = 4.) At an odd time step for each even site r — 1, a number of particles is chosen to attempt hopping 
to the right. If there are enough particles at site r — 1 and enough space on site r, the chosen number hops. In the example 
shown, the filled particles are the ones to hop and the dashed boxes show their positions after the time step. No particle which 
could hop is on site 6. The particle on site cannot hop since its destination site is already fully occupied. For site 2, one 
particle has been chosen. On site 4, at least two particles tried to hop. If the number chosen was larger, it would have been 
cut down to two since there are only two particles on site 4 and since there are only two spaces left at site 5. 



The more complicated hopping process is reflected in a different matrix Ti{X/W) without changing anything else 
in the calculations. Thus, the significance assessment constant A is still given by the generating function of the space 
and time averaged current as 

exp[Aso/2](exp[-AAJ])^ = 1 (66) 

but the calculation of this generating function for an arbitrary distribution ( |60| ) becomes much more difficult for the 
generalized asymmetric exclusion process than for the case rimax = 1 of the original asymmetric exclusion process. 

Already, the knowledge of the dependence of the average current on the scoring parameters would be very helpful 
to biologists, since this determines the position of the log-linear phase transition. As discussed in the case of the 
simpler scoring system, the phase transition occurs, if the first moment of the score distribution vanishes, i.e., for 



exp[Aso/2](exp[-AAJ]) ~ = so/2 - = so/2 - {j)oA (67) 



The average current is much easier to calculate, since in contrast to the generating function, it is independent of 
temporal correlations. Thus, it can be calculated from the knowledge of the stationary state alone. For the original 
asymmetric exclusion process, the occupation numbers of the stationary state become independent random variables. 
For the generalized asymmetric exclusion process presented here, this is not the case any more. If the number of 
particles which hop in one move is at most one (as for the scoring system (0) with arbitrary gap costs) approximating 
the stationary state as a product state still yields reasonable values of {j)o and hence the phase transition point 
((5c, /ic) Nevertheless, exact results or at least systematic improvements taking into account the spatial correlations 
of the occupation numbers would be desirable. For the more general case allowing for an arbitrary number of particles 
to hop at a given time, no analytical result is known. 



VIII. CONCLUDING REMARKS 



In this paper, we have shown how a question of great practical importance to molecular biologists, like the signifi- 
cance assessment of local sequence alignment results, can be answered by studying the asymmetric exclusion process, 
an exactly solvable model of the KPZ universality class. Conversely, in trying to answer this question for biologists, 
we derived an important physical quantity like the generating function Z for the corresponding physical system in 
discrete time and discrete space. This complements the existing solutions in continuous time and space [ p^ and in 
continuous time and discrete space ||l^ . Our result is the first successful analytical approach to assessing the statistical 
significance of sequence alignment with gaps. 

Future work of p ractical importance includes solving the generalizations of the asymmetric exclusion process de- 
scribed in Sec. VI] and studying the effect of the widely used "affine gap cost" , where a contiguous gap of length £ 
is assigned some gap cost S + {£ — l)e instead of simply Si. A general expression which gives A as a function of an 
arbitrary scoring system should finally give rise to a deeper understanding of the role of the gap cost and lead to 
better choices of scoring systems for alignments of biological sequences. 
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APPENDIX A: ISLAND HIGH SCORE DISTRIBUTION 



In this appendix we derive heuristically the Poisson distribution of maximal island scores. We first treat the gapless 
case and then generalize the derivation to alignment with gaps. In the gapless case, the distribution of large 
islands of length L measured from their beginning to their peak point at height a is given by 

L 

= (Al) 

i=l 

Using the Fourier representation of the delta hmction and the statistical independence of the s{i) this yields 

p{a) = 2^ y" exp(-ifc(T)(exp(ifcs))-^dfc. (A2) 

If we assume that the peak score of the island is proportional to its length, i.e., that an island has on average a linear 
slope a, we get 

p{(t) = 77- / exp(— ifcQ;L)(exp(ifcs))^dfc, (A3) 



27r _ 

which can be evaluated in a saddle point approximation as 

p{a) exp(-A(T) (A4) 

with 

X = ik* - log[(exp(iA:*s))]/Q:. (A5) 
The saddle point k* is given by the saddle point equation 

(s exp(i/c*s)) 



(exp(ifc*s))a 



= 1. (A6) 



This k* is itself a function of the so far unknown slope a. To find the correct value of a, we minimize Eq. (A5) with 
respect to a and get together with Eq. (A6) 

{exp{ik*s)) = 1. (A7) 

Inserting this into Eq. (A5) yields condition (Q). Additionally we get from Eq. ( |A6| ) the typical slope a of an island 
as 

a = (sexp(As)). (A8) 

For alignment with gaps, the high score of an island of length L from its beginning to its peak point is not just the 
sum of local scores any more. Instead, it is given by the final score h{0, L) of a global alignment of two sequences of 
length L taking into account all possible insertions of gaps. We can still use the Fourier transformation to get 

p(cr) = {6{a - h{0, L))) = ^ J exp{~ika){exp{ikh{0, L)))dk. (A9) 

In Sec. |VB| we will see, that (exp(A/i(0, L))) is for large L the L'th power of the eigenvalue of some matrix. We thus 
define p(A) by 

(exp[A/i(0,L)]) =p^(A) (AlO) 

and again assume a linear slope a of the islands which we conveniently define by cr = aL/2 in order to take into 
account that the lattice of length L actually only contains L/2 matches or mismatches in a row. We then get 

p(cr) = — / cxp[{-ika/2 + log p{ik))L]dk. (All) 
27r J 

Applying the above saddle point approximation and maximization with respect to the slope of the island a yields 
Eq. (^). Moreover it gives the typical slope of an island as 

a = 2^ = |(/j(0, L) exp[XhiO, L)])). (A12) 
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APPENDIX B: EXPRESSION OF THE SCORE DYNAMICS IN TERMS OF PARTICLE OCCUPATION 

NUMBERS 



In this appendix we describe the mapping from the evolution equation ( |2^ ) of the sequence ahgnment scores onto 
the asymmetric exclusion process with the n(r, t) as the particle occupation numbers in detail. To this end we apply 
Eq. ( p6| ) to the definition Eq. ( |27| ) of n(r, t), where we assume by convention that r + 1 is even as in Fig. |(a). We get 

n{r-l,t) = h{r,t + 1) - h{r - l,t) 

= niax{/i(r,t- l) + ry(r, t),h{r - \,t),h{r + l,t)} - h{r - \,t)} 

= h{r,t-l)-h{r~\,t) + \ + mdjyL{'q{r,t)-l,h{r-\,t)~h{r,t-l)-l,h{r+l,t)-h{r,t~t)~l} 

— n{r — 1, t — 1) + max{ry(r, t) — l, — n(r — 1, l).n(r, i — 1) — 1} 

— n{r — 1, t — 1) — min{l — 7y(r, t),n(r— 1), 1 — n(r, 1)} 

and analogously 

n(r, t)=h{r + \,t)- h{r, t + 1) + 1 

= h{r + l,t) - max{/i(r, t - I) + i]{r, t), h{r - l,t), h{r + 1, t)} + 1 

= h{r + l,t)-h{r,t-l)-max{ri{r, t)-l,h{r'-l,t)-h{r,t-l)-l,h{r+l,t)-h{r,t-l)-l} 
= n(r, i — 1) — max{?7(r, t) — 1, —n{r—l, t—l),n{r,t—l) — l} 
= n{r,t — 1) + min{ 1 — ry(r, i),7i(r— 1, 1), 1 — n(r, <— 1)}. 

This can be summarized in the form 

n(r- = n(r - l,t - 1) - j(r,i) and (Bl) 

n{r,t)=n{r,t-l)+j{r,t), (B2) 

where 

j{r, t) = min{l - ri{r, t),l- n{r, t - l),n{r ~ l,t ~ 1)}. (B3) 

As we can see, there is no reference to the actual alignment scores h{r, t) in these equations. As a first consequence 
of these equations we note that they imply that the variables n(r, t) can only take on the values zero and one. This 
is obvious by induction, if it is fulfilled at t = as it is the case for our choice of initial condition^ Thus, it is 
reasonable to interpret the n{r, t) as particle occupation numbers. 

Moreover, we note that a pair of neighboring occupation numbers {n(r — 1, t), n{r^ t)) at time t depends only on the 
corresponding pair (n(r — 1, i — 1), 7i(r, t — 1)) at time t — 1 and the random variable r?(r, t). Thus, the elements as 
the one shown in Fig. ^ perform these transformations of a pair of neighboring occupation numbers into a new pair 
of neighboring occupation numbers completely independently from each other. 

Looking at Eqs. (pll)-(p3|) more closely, we see that j(r,t) = whenever (n(r-l,t-l), n{r,t-l)) e {|00), |01), |11)}. 
Thus, (n(r — l,t),n(r, i)) = (n(r — l,t — l),n(r, t — 1)) in these cases. Only if site r — 1 is occupied and site r is 
empty, the number j{r, t) of transfered particles can be one with probability Pr{'ri{r, t) = 0} = 1—p. This leads to the 
interpretation of the dynamics given by Eqs. ( [Bl| )-( p^ ) as an asymmetric exclusion process described by the transfer 
matrix Ti(0) defined in Eq. ( P8|) of the main text. 

So far we transformed the dynamics of the sequence alignment algorithm as given by Eq. (^6|) into an asymmetric 
exclusion process. We still have to express Zo{X; N) in terms of this asymmetric exclusion process. To achieve this, 
we first define for any "time" t the average score (or space-averaged surface height) 

^El^o'[M2A:,i-l) + /i(2fc+l,i)] t even 



2W 

^ ■EZro'[H^k,t) + h{2k + l,t-l)] todd 



V 2W 



^Even if the initial values of the n{r,t = 0) are not zero or one they will under the dynamics Eqs. (B1)-(B3) eventually try to 
take on values less than zero or larger than one. The minimum in Eq. (Be) then resets them to zero or one. Thus, after some 
startup phase, the n(r, t) will be integer even if their initial values are chosen to be non-integer. 
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Because of the translational invariance of the system in the spatial (r) direction we get 

Zq{X; N) = (exp[A/^(0, iV)])o = (cxp[A7I(iV)])o. (B5) 

Thus, we can restrict ourselves to calculating the large N behavior of the latter quantity. 

The change in the average score h{t) can be expressed in terms of the occupation numbers n{r,t) via Eqs. ( |2^ ) 
and (|27|). It is given by 

S« + 1) - MO 4 * ' ^ " - ' - ' . (B6) 

I 5IFEE!.o'[M2l:+l.< + l)-''(2t + l,(-l)l (odd 

The local score differences in this equation can for even r + t he expressed as 

h{r, t + 1) - h{r, t - 1) = max{ /i(r, t - I) + r]{r, t),h{r + 1, t), h{r - 1, t)} - h{r, t - 1) 
= 1 + max{?7(r, t) — 1, n(r, t — 1) — 1 — ri(r — 1, < — 1)} 
= 1 — min{l — rj{r, t), 1 — n{r, f — 1), rt(r — 1, < — 1)} 
= l-]{r,t). 



Inserting this into Eq. (B6) yields 



- 1 1 f Ero'i(2^'0 t even 
h{t + l)-h{t)^-~ — { ^''-'^ . (B7) 



Combining Eqs. (B5) and (B7) finally yields 



N-l 



Zo(A; N) = (exp[A/i(7V)]>o = (exp[A ^ + 1) - /;,(t))])o (B8) 

t=o 

= cxp[A7V](cxph— ^ ^ (j(2fc + 1, 2/ - 1) + j(2fc, 20)])o 

1 = 1 k=0 



exp [A A^] (exp [— A J] ) , 



where 



N/2W-1 

^ ^ E E + 1,2/-!)+ j(2A:, 20) (B9) 

1=1 k=0 

is the total number of particles hopped divided by the number of sites. This is Eq. of the main text. 



APPENDIX C: DYNAMIC PATH INTEGRAL REPRESENTATION 

In this appendix we want to show that the generating function Q(A; W, N) can be expressed as a product of some 4^ 
dimensional matrices as stated in Eq. (|3^) in the main text. This rewriting is crucial in transforming the calculation 
of the generating function into an eigenvalue problem. We start from the definition 

N/2W-1 

Q(A; W, N) = (exp[-AJ])o = (JI 11 e-wJ(2fe+i,2/-i)g-5^j(2fe,2Z)^^^ (^1) 

1 = 1 k=0 

Since, the number of particles in each bin must be either or 1 at any time, we do not change the expectation value, 
if we introduce ones of the form 

2W-1 

1 = E n ^n{r,t),n^,t (C2) 

{nr,t}e{0,l}^"' '■=0 
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at each fixed time t. Ttiis corresponds to a path integral formulation of the quantity Qi\; W, N) and yields 



(exp[-AJ])o = (C3) 

2W-1 N/2 /2W~1 \ /W-1 \ 

{rir-.o} {n,.,N} '•=0 1=1 \ r=0 / \ k=0 / 

/2W-1 \ /W-1 \ 

\ r -0 / \ k=0 ) 

Once a configuration of the particles at each time step is fixed, the expectation value can be factorized into the parts 
which contain only a single random variable r](r^ t) 

2W-1 N/2 /2W-1 \ /W-1 \ /2W-1 \ /W-1 \ 

( n '^"(.,o),„..o n n n e-ww-^'-^) n n e-w.^^M^ )„= 

r=0 1=1 \ r=0 J \ k=0 / \ r=0 / \ 1=0 ) 

2W-1 

W ^n(r,0),nr,o ^ 
N/2W-1 

X H H ('^n(2fc,2i-2),n2fc,2,_2'^n(2fc+l,2i-2),n2fc + l,2l-2'3"^''*'^''^^'^' <5„(2fc,2/-l) ,n+2fc,2/- 1 '5n(2fc,2/-l) ,n2fc+i,2! - 1 )o X 
Z=l fc=0 

X Jl ((^71(2^-1, 2!-l),n2fc-l,2l-l'^"(2fc,2;-l),n2fc,2;-ie~^-'''^'''^'^<5„(2fc-l, 20, «2fc-l,2!'^n(2fe,2/).n2fc + l.2jo X 1. 
fc=0 



Inserting this into Eq. (C2) we can interpret the summation over the possible configurations of the particles at each 
time step as the summation of inner indices in a matrix multiplication. In this language the first term Or^~^ ^n{rfl),nr a 
is a vector on the 4^^ dimensional vector space indexed by all possible particle configurations. This vector has exactly 
one non vanishing entry at the configuration which is chosen as the initial configuration at < = 0. This non vanishing 
entry is one and we call this vector |V'o)- The factor of one which we added for the sake of clarity also plays the role 
of a vector the entries of which are all one. We call this vector {ipi\. All the other factors represent matrices. There 
is one matrix for every time step and each of these matrices is a tensor product of W identical matrices describing an 
elementary hopping process. Their matrix elements are 

(Ti (777)) = (<5n(r-l,t-l),n;'^n(r,t-l).n;j exp[-— — j(r, i)](5„(r_l,f).„i(5„(r,t),„2)o- (C4) 

V //(ni,n2),(«'i,ni) 

The disorder average here is over one single random variable ri{r, t). Performing this disorder average yields the matrix 
Ti(A/VF) as defined in Eq. (^J). The matrices for even time steps and the matrices for odd time steps are shifted 
against each other by one lattice unit which finally leads to the expression of Eq. (^7|) . 
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